In [28]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import dateutil.parser
from dateutil import parser
from datetime import datetime

First, I made a mistake naming the data set! It's 2015 data, not 2014 data. But yes, still use 311-2014.csv. You can rename it.

Importing and preparing your data

Import your data, but only the first 200,000 rows. You'll also want to change the index to be a datetime based on the Created Date column - you'll want to check if it's already a datetime, and parse it if not.



In [30]:

    
complaint_df = pd.read_csv("311-2014.csv", parse_dates = True).head(200000)









    



/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2723: DtypeWarning: Columns (8,17,40,41,42,43,44,45,46,47,48,49) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)



In [32]:

    
complaint_df.dtypes









    Out[32]:





Unique Key                          int64
Created Date                       object
Closed Date                        object
Agency                             object
Agency Name                        object
Complaint Type                     object
Descriptor                         object
Location Type                      object
Incident Zip                       object
Incident Address                   object
Street Name                        object
Cross Street 1                     object
Cross Street 2                     object
Intersection Street 1              object
Intersection Street 2              object
Address Type                       object
City                               object
Landmark                           object
Facility Type                      object
Status                             object
Due Date                           object
Resolution Description             object
Resolution Action Updated Date     object
Community Board                    object
Borough                            object
X Coordinate (State Plane)        float64
Y Coordinate (State Plane)        float64
Park Facility Name                 object
Park Borough                       object
School Name                        object
School Number                      object
School Region                      object
School Code                        object
School Phone Number                object
School Address                     object
School City                        object
School State                       object
School Zip                         object
School Not Found                   object
School or Citywide Complaint      float64
Vehicle Type                       object
Taxi Company Borough               object
Taxi Pick Up Location              object
Bridge Highway Name                object
Bridge Highway Direction           object
Road Ramp                          object
Bridge Highway Segment             object
Garage Lot Name                    object
Ferry Direction                    object
Ferry Terminal Name                object
Latitude                          float64
Longitude                         float64
Location                           object
dtype: object



In [8]:

    
import datetime
# complaint_df['Created Date'] = complaint_df['Created Date'].astype(dateutil.parser.parse)



In [34]:

    
complaint_df['Created Date'] = complaint_df['Created Date'].apply(dateutil.parser.parse)



In [36]:

    
complaint_df['Created Date'].describe()









    Out[36]:





count                  200000
unique                 177667
top       2015-05-04 00:00:00
freq                     1034
first     2015-01-01 00:08:34
last      2016-01-01 00:00:00
Name: Created Date, dtype: object



In [38]:

    
complaint_df['Created Date'].head()









    Out[38]:





0   2015-07-06 10:58:27
1   2015-07-03 13:26:29
2   2015-11-09 03:55:09
3   2015-07-03 02:18:32
4   2015-07-04 00:03:27
Name: Created Date, dtype: datetime64[ns]



In [40]:

    
#import datetime

#dt = datetime.datetime.strptime(string_date, fmt)



In [42]:

    
#import numpy as np
# Build a function using that method
#str_time = "07/06/2015 10:58:27 AM"
#def time_to_datetime(str_time):
 #   try:
   #     dt = parser.parse(str_time)
  #      return dt
   # except:
    #    return np.nan
#time_to_datetime(str_time)



In [44]:

    
# complaint_df['Created Date'] = complaint_df['Created Date'].apply(time_to_datetime)



In [46]:

    
complaint_df['Created Date'].describe()









    Out[46]:





count                  200000
unique                 177667
top       2015-05-04 00:00:00
freq                     1034
first     2015-01-01 00:08:34
last      2016-01-01 00:00:00
Name: Created Date, dtype: object

What was the most popular type of complaint, and how many times was it filed?



In [ ]:

    
complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)

Make a horizontal bar graph of the top 5 most frequent complaint types.



In [48]:

    
import matplotlib.pyplot as plt
%matplotlib inline
freqNYcomplaints = complaint_df.groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(5).plot(kind="bar", color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'])
freqNYcomplaints.set_title('Most Frequent NYC Complaints')
freqNYcomplaints.set_xlabel('Type of Complaint')
freqNYcomplaints.set_ylabel('Frequency of Complaint')
plt.savefig('freqNYcomplaints.png')

Which borough has the most complaints per capita? Since it's only 5 boroughs, you can do the math manually.



In [ ]:

    
complaint_df.head()

According to your selection of data, how many cases were filed in March? How about May?



In [56]:

    
#confirm this is a date type 
# complaint_df['Created Date']



In [ ]:



In [55]:

    
complaint_df.index = complaint_df['Created Date']
# complaint_df.head()



In [199]:

    
#complaint_df = complaint_df.set_index(['Created Date'])
#complaint_df['Created Date'] = pd.to_datetime(complaint_df['Created Date'])
ax = complaint_df['2015-03'].resample('D').count().plot(y='Unique Key')
# complaint_df['2015-04']['Complaint Type'].count()
ax.set_title('NYC Complaints in March 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MarchfreqNYcomplaints.png')



In [200]:

    
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')



In [58]:

    
complaint_df.head()









    Out[58]:






  
    
      
      Unique Key
      Created Date
      Closed Date
      Agency
      Agency Name
      Complaint Type
      Descriptor
      Location Type
      Incident Zip
      Incident Address
      ...
      Bridge Highway Name
      Bridge Highway Direction
      Road Ramp
      Bridge Highway Segment
      Garage Lot Name
      Ferry Direction
      Ferry Terminal Name
      Latitude
      Longitude
      Location
    
    
      Created Date
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2015-07-06 10:58:27
      31015465
      2015-07-06 10:58:27
      07/22/2015 01:07:20 AM
      DCA
      Department of Consumer Affairs
      Consumer Complaint
      Demand for Cash
      NaN
      11360
      27-16 203 STREET
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.773540
      -73.788237
      (40.773539552542, -73.78823697228408)
    
    
      2015-07-03 13:26:29
      30997660
      2015-07-03 13:26:29
      07/03/2015 02:08:20 PM
      NYPD
      New York City Police Department
      Vending
      In Prohibited Area
      Residential Building/House
      10019
      200 CENTRAL PARK SOUTH
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.767021
      -73.979448
      (40.76702142171206, -73.97944780718524)
    
    
      2015-11-09 03:55:09
      31950223
      2015-11-09 03:55:09
      11/09/2015 08:08:57 AM
      NYPD
      New York City Police Department
      Blocked Driveway
      No Access
      Street/Sidewalk
      10453
      1993 GRAND AVENUE
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.852671
      -73.910608
      (40.85267061877697, -73.91060771362552)
    
    
      2015-07-03 02:18:32
      31000038
      2015-07-03 02:18:32
      07/03/2015 07:54:48 AM
      NYPD
      New York City Police Department
      Noise - Commercial
      Loud Music/Party
      Club/Bar/Restaurant
      11372
      84-16 NORTHERN BOULEVARD
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.755774
      -73.883262
      (40.755773786469966, -73.88326243225418)
    
    
      2015-07-04 00:03:27
      30995614
      2015-07-04 00:03:27
      07/04/2015 03:33:09 AM
      NYPD
      New York City Police Department
      Noise - Street/Sidewalk
      Loud Talking
      Street/Sidewalk
      11216
      1057 BERGEN STREET
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      40.676175
      -73.951269
      (40.67617516102934, -73.9512690004692)
    
  

5 rows × 53 columns

I'd like to see all of the 311 complaints called in on April 1st.

Surprise! We couldn't do this in class, but it was just a limitation of our data set



In [67]:

    
print("There were", complaint_df['2015-04-01']['Unique Key'].count(), "complaints on April 1st.")









    



There were 573 complaints on April 1st.

What was the most popular type of complaint on April 1st?

What were the most popular three types of complaint on April 1st



In [71]:

    
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(1)









    Out[71]:





Complaint Type
Illegal Parking    67
Name: Complaint Type, dtype: int64



In [74]:

    
complaint_df['2015-04-01'].groupby('Complaint Type')['Complaint Type'].count().sort_values(ascending=False).head(3)









    Out[74]:





Complaint Type
Illegal Parking     67
Street Condition    64
Blocked Driveway    58
Name: Complaint Type, dtype: int64

What month has the most reports filed? How many? Graph it.



In [77]:

    
complaint_df.resample('M')['Complaint Type'].count().sort_values(ascending=False).head(1)









    Out[77]:





Created Date
2015-05-31    49715
Name: Complaint Type, dtype: int64



In [201]:

    
ax = complaint_df['2015-05'].resample('D').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in May 2015')
ax.set_xlabel('Day of the Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('MayfreqNYcomplaints.png')

What week of the year has the most reports filed? How many? Graph the weekly complaints.



In [97]:

    
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(3)









    Out[97]:





Created Date
2015-05-10    13559
2015-05-17    11683
2015-05-24    10351
Name: Complaint Type, dtype: int64



In [102]:

    
complaint_df['2015-05-10'].resample('W')['Complaint Type'].count().plot(y='Unique Key')









    



/Users/Monica/.virtualenvs/dataanalysis/lib/python3.5/site-packages/matplotlib/axes/_base.py:2787: UserWarning: Attempting to set identical left==right results
in singular transformations; automatically expanding.
left=2367.0, right=2367.0
  'left=%s, right=%s') % (left, right))






    Out[102]:





<matplotlib.axes._subplots.AxesSubplot at 0x11cb0fef0>

Noise complaints are a big deal. Use .str.contains to select noise complaints, and make an chart of when they show up annually. Then make a chart about when they show up every day (cyclic).



In [202]:

    
ax = complaint_df['Complaint Type'].str.contains('Noise').resample('W').count().plot(y='Unique Key')
ax.set_title('NYC Complaints in 2015')
ax.set_xlabel('Month')
ax.set_ylabel('Number of Complaints')
plt.savefig('AnnualfreqNYcomplaints.png')



In [210]:

    
ax = complaint_df['Complaint Type'].str.contains('Noise').groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key')

ax.set_title('Daily NYC SoundComplaints in 2015')
ax.set_xlabel('Number of Complaints')
plt.savefig('DailyfreqNYSoundcomplaints.png')

#ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
#ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])



In [ ]:

Which were the top five days of the year for filing complaints? How many on each of those days? Graph it.



In [124]:

    
complaint_df.resample('W')['Complaint Type'].count().sort_values(ascending=False).head(5)









    Out[124]:





Created Date
2015-05-10    13559
2015-05-17    11683
2015-05-24    10351
2015-05-03    10184
2015-05-31     9387
Name: Complaint Type, dtype: int64



In [176]:

    
#may_17 = complaint_df['2015-05-17']
#ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key')



In [183]:

    
may_17 = complaint_df['2015-05-17']
ax = may_17.groupby(by=may_17.index.hour).count().plot(y='Unique Key', label="May 17th")
may_24 = complaint_df['2015-05-24']
may_24.groupby(by=may_24.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 24th")
may_03 = complaint_df['2015-05-03']
may_03.groupby(by=may_03.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 3rd")
may_10 = complaint_df['2015-05-10']
may_10.groupby(by=may_10.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 10th")
may_31 = complaint_df['2015-05-31']
may_31.groupby(by=may_31.index.hour).count().plot(y='Unique Key', ax=ax, label = "May 31st")
ax.set_xticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
ax.set_xticklabels([12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])









    Out[183]:





[<matplotlib.text.Text at 0x17332efd0>,
 <matplotlib.text.Text at 0x173333080>,
 <matplotlib.text.Text at 0x173364080>,
 <matplotlib.text.Text at 0x17335cd68>,
 <matplotlib.text.Text at 0x173369320>,
 <matplotlib.text.Text at 0x173369d68>,
 <matplotlib.text.Text at 0x1733746a0>,
 <matplotlib.text.Text at 0x173369908>,
 <matplotlib.text.Text at 0x173355978>,
 <matplotlib.text.Text at 0x17338d2b0>,
 <matplotlib.text.Text at 0x17335c5f8>,
 <matplotlib.text.Text at 0x1733170b8>,
 <matplotlib.text.Text at 0x173322ef0>,
 <matplotlib.text.Text at 0x173325160>,
 <matplotlib.text.Text at 0x1733a9748>,
 <matplotlib.text.Text at 0x1733a9f60>,
 <matplotlib.text.Text at 0x1733ad7b8>,
 <matplotlib.text.Text at 0x1733ae1d0>,
 <matplotlib.text.Text at 0x1733aec18>,
 <matplotlib.text.Text at 0x1733b26a0>,
 <matplotlib.text.Text at 0x1733b5128>,
 <matplotlib.text.Text at 0x1733b5b70>,
 <matplotlib.text.Text at 0x1733ba5f8>,
 <matplotlib.text.Text at 0x1733bc080>]



In [174]:

    
ax









    Out[174]:





<matplotlib.axes._subplots.AxesSubplot at 0x171ae2198>



In [182]:

    
# ax = complaint_df['2015-05-10'].resample('H').count().plot(y='Unique Key')
#ax = complaint_df['2015-05-17'].groupby(by=complaint_df.index.hour).count().plot(y='Unique Key', label = "May 17th")
# complaint_df['2015-05-24'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 24th")
# complaint_df['2015-05-03'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 3rd")
# complaint_df['2015-05-31'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = "May 31st")
# # the least sold in a month (e.g., the worst January ever , the worst July ever, etc)
# complaint_df['2015-05-17'].resample('H')['Complaint Type'].count().plot(y='Unique Key', ax=ax, label = 'May 17th')

# the most sold in a month (e.g., the best January ever , the best July ever, etc)
#df.groupby(by=df.index.month).max().plot(y='val', ax=ax, label = 'Most Sold')

What hour of the day are the most complaints? Graph a day of complaints.



In [197]:

    
#make a graph of the average 
ax = complaint_df.groupby(by=complaint_df.index.hour).mean().plot(y='Unique Key', label='Average')
#x_values = complaint_df.groupby(by=complaint_df.index.hour).median().index
# get the minimum
#min_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].min()
#get maxs
#max_values = complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].max()
#ax.fill_between(x_values, min_values, max_values, alpha=0.3)
#ax.set_title("Fluctuation of House Sales")



In [192]:

    
complaint_df.groupby(by=complaint_df.index.hour)['Unique Key'].median()









    Out[192]:





0     30614734
1     30867364
2     30793150
3     30785415
4     31058530
5     30853700
6     31021120
7     30812242
8     30782634
9     30776134
10    30766641
11    30774043
12    30780706
13    30775608
14    30766844
15    30747919
16    30746354
17    30800853
18    30752702
19    30804272
20    30945024
21    30924831
22    30893587
23    30857818
Name: Unique Key, dtype: int64

One of the hours has an odd number of complaints. What are the most common complaints at that hour, and what are the most common complaints the hour before and after?



In [214]:

    
ax = complaint_df['Complaint Type'].str.contains('04:').resample('T').count().plot(y='Unique Key')



In [ ]:



In [ ]:

So odd. What's the per-minute breakdown of complaints between 12am and 1am? You don't need to include 1am.



In [ ]:

Looks like midnight is a little bit of an outlier. Why might that be? Take the 5 most common agencies and graph the times they file reports at (all day, not just midnight).



In [ ]:

Graph those same agencies on an annual basis - make it weekly. When do people like to complain? When does the NYPD have an odd number of complaints?



In [ ]:

Maybe the NYPD deals with different issues at different times? Check the most popular complaints in July and August vs the month of May. Also check the most common complaints for the Housing Preservation Bureau (HPD) in winter vs. summer.



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	Unique Key	Created Date	Closed Date	Agency	Agency Name	Complaint Type	Descriptor	Location Type	Incident Zip	Incident Address	...	Bridge Highway Name	Bridge Highway Direction	Road Ramp	Bridge Highway Segment	Garage Lot Name	Ferry Direction	Ferry Terminal Name	Latitude	Longitude	Location
Created Date
2015-07-06 10:58:27	31015465	2015-07-06 10:58:27	07/22/2015 01:07:20 AM	DCA	Department of Consumer Affairs	Consumer Complaint	Demand for Cash	NaN	11360	27-16 203 STREET	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.773540	-73.788237	(40.773539552542, -73.78823697228408)
2015-07-03 13:26:29	30997660	2015-07-03 13:26:29	07/03/2015 02:08:20 PM	NYPD	New York City Police Department	Vending	In Prohibited Area	Residential Building/House	10019	200 CENTRAL PARK SOUTH	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.767021	-73.979448	(40.76702142171206, -73.97944780718524)
2015-11-09 03:55:09	31950223	2015-11-09 03:55:09	11/09/2015 08:08:57 AM	NYPD	New York City Police Department	Blocked Driveway	No Access	Street/Sidewalk	10453	1993 GRAND AVENUE	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.852671	-73.910608	(40.85267061877697, -73.91060771362552)
2015-07-03 02:18:32	31000038	2015-07-03 02:18:32	07/03/2015 07:54:48 AM	NYPD	New York City Police Department	Noise - Commercial	Loud Music/Party	Club/Bar/Restaurant	11372	84-16 NORTHERN BOULEVARD	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.755774	-73.883262	(40.755773786469966, -73.88326243225418)
2015-07-04 00:03:27	30995614	2015-07-04 00:03:27	07/04/2015 03:33:09 AM	NYPD	New York City Police Department	Noise - Street/Sidewalk	Loud Talking	Street/Sidewalk	11216	1057 BERGEN STREET	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40.676175	-73.951269	(40.67617516102934, -73.9512690004692)